doc: add cluster manager reference architecture by minaelee · Pull Request #1209 · canonical/microcloud

minaelee · 2026-02-02T23:21:50Z

Add reference architecture documentation for MicroCloud Cluster Manager.

edlerd

Excellent start. I have many thoughts and comments below. We can have a chat if you like to clarify on the open issues.

edlerd · 2026-02-04T15:06:57Z

doc/reference/cluster-manager-architecture.md

+
+The MicroCloud Cluster Manager is a centralized tool that provides an overview of MicroCloud deployments. In its initial implementation, it provides an overview of resource usage and availability for all clusters. Future implementations will include centralized cluster management capabilities.
+
+Cluster Manager stores the data from registered clusters in Postgres and Prometheus databases. This data can be displayed in the Cluster Manager UI, which also links to Grafana dashboards for each MicroCloud.


which also links to Grafana dashboards for each MicroCloud

This is a possible extension. By default, the COS stack is not available. So a user will deploy cluster manager and get the manager UI without links to Grafana.

Does this update work, or would you prefer we did not mention Grafana at all?

This data can be displayed in the Cluster Manager UI, which can be extended to link to Grafana dashboards for each MicroCloud.

Note: This information is from https://github.com/canonical/microcloud-cluster-manager/blob/main/ARCHITECTURE.md and likely should be updated there as well.

Yes, suggestion sounds good to me. I'll take a note to update the architecture file.

edlerd · 2026-02-04T15:23:21Z

doc/reference/cluster-manager-architecture.md

+```{figure} ../images/cluster_manager_architecture.png
+   :alt: A diagram of Cluster Manager architecture
+   :align: center
+```


This diagram is from an earlier development environment. It is mostly correct, but some things have slightly changed.

Is there an updated diagram, or can you let me know what has changed and I can update it?

We don't have an updated diagram yet. Things that have changed:

A single TCP load balancer instead of two, exposing two different domain names. One for the management-api and one for the cluster-connector.

Postgres service / pg deployment and volume claims are "just" one thing: the postgres charm. The rest is detail of the PG charm. the diagram exposes to much detail of the PG charm internals with assumptions that might be wrong

Cert manager is to be replaced by a charm implementing the "certificate" charm interface. We might just change the label here.

k8s secrets/k8s config to be replaced by a juju config layer. Under the hood this is still true. I am not sure how to unify the levels of detail in the diagram to surface k8s internals and charm/juju internals.

management-api and cluster-connector live together on the same container. Each container is running those two processes. there can be multiple containers to scale out.

we might want to add Canonical observability stack as an optional extension to the diagram. With prometheus and grafana.

I can create a task for myself to create an updated diagram.

edlerd · 2026-02-04T15:24:56Z

doc/reference/cluster-manager-architecture.md

+That static external IP acts as the gateway to route user traffic to the appropriate Kubernetes load balancers.
+
+TCP load balancers
+: Two TCP load balancer services distribute traffic to the Management API and Cluster Connector deployments without terminating TLS. Instead, TLS termination is handled directly within each deployment application. This approach is particularly crucial for the Cluster Connector deployment, as it relies on mutual TLS (mTLS) authentication for secure communication.


We are using a single Traefik instance that is dealing with the incoming requests, no two load balancers anymore.

Fixed to:

A TCP load balancer (using a Traefik instance) distributes traffic to the Management API and Cluster Connector deployments without terminating TLS.

edlerd · 2026-02-04T15:27:45Z

doc/reference/cluster-manager-architecture.md

+Certificate manager
+: Manages TLS/SSL certificates for secure communication within the Kubernetes cluster. It stores secrets in Kubernetes to be used by various components. The certificates are used by both the Management API and Cluster Connector deployments for HTTPS encryption.


We now rely on a charm that implements the certificates interface to provide certificates. This can be the self-signed-certificates charm, as suggested in the readme. We do not rely on the certificate manager k8s app anymore.

Should the Certificate manager section in lines 33-34 above be removed entirely?

I think we can remove it, yes.

edlerd · 2026-02-04T15:28:34Z

doc/reference/cluster-manager-architecture.md

+Persistent Volume (PV) and Persistent Volume Claim (PVC)
+: The Persistent Volume is the storage resource provisioned for the Postgres deployment. The Persistent Volume Claim is the request for storage by the Postgres deployment to ensure data persistence.


We rely on the canonical Postgres charm. How that charm does persistent storage is outside our control.

What information should we provide in this section instead, or should we remove it entirely?

I think we can remove it, yes.

doc/reference/cluster-manager-architecture.md

edlerd · 2026-02-04T15:55:14Z

doc/reference/cluster-manager-architecture.md

+(ref-cluster-manager-architecture-management-ui)=
+### UI
+
+The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests.


Suggested change

The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests.

The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens.

We can expand here, we serve warnings and metric insights on a high level as well as a list of all registered clusters.

The Management API handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens. The UI also serves warnings and metric insights on a high level.

I added this, but "on a high level" could bear more explanation. Do you mean through optional extension with Grafana, or something more/else?

High level means aggregates of instances and microclouds cluster members. Like the number of instances and their status distribution (how many are started/stopped/etc). If the cluster manager is extended with COS/Grafana stack, then Grafana indeed holds detailed information about each instance in every cluster.

doc/reference/cluster-manager-architecture.md

edlerd · 2026-02-04T16:00:37Z

doc/reference/cluster-manager-architecture.md

+- mTLS authentication check against the matched certificate
+- Store and overwrite data in the `remote_cluster_details` table
+
+To avoid overwhelming the Cluster Connector deployment, the status endpoints are rate limited. The response sent to the originating cluster includes a delay period (in seconds) that must pass before the next status signal request.


All endpoints are rate limited, not just this one.

Updated to:

(ref-cluster-manager-architecture-rate-limited)= ## Rate limited endpoints To avoid overwhelming the Cluster Manager, all its endpoints are rate limited. When any endpoint receives a request from a cluster, the response from Cluster Manager includes a delay period (in seconds) that must pass before the next request to that endpoint.

Or did you mean all endpoints for the Cluster Connector deployment only?

Also: do you want to change the term "Cluster Connector deployment" to "Cluster Connector" (like with "Management API deployment" to "Management API") or does it make sense to keep the word "deployment" here?

I think the suggestion is slightly confusing. We have rate limiting in place to avoid overwhelming the cluster manager, yes.

The functionality to signal to the Microcloud in a response when they should call in again is unrelated to the rate limiting, though.

Signed-off-by: Minae Lee <minae.lee@canonical.com>

github-actions bot added the Documentation Documentation needs updating label Feb 2, 2026

minaelee force-pushed the cluster-manager-architecture branch from 84f08c8 to 2496821 Compare February 4, 2026 15:43

edlerd reviewed Feb 4, 2026

View reviewed changes

doc: add cluster manager reference architecture

a2c04aa

Signed-off-by: Minae Lee <minae.lee@canonical.com>

minaelee force-pushed the cluster-manager-architecture branch from 2496821 to a2c04aa Compare February 4, 2026 23:36


		The MicroCloud Cluster Manager is a centralized tool that provides an overview of MicroCloud deployments. In its initial implementation, it provides an overview of resource usage and availability for all clusters. Future implementations will include centralized cluster management capabilities.

		Cluster Manager stores the data from registered clusters in Postgres and Prometheus databases. This data can be displayed in the Cluster Manager UI, which also links to Grafana dashboards for each MicroCloud.

		Certificate manager
		: Manages TLS/SSL certificates for secure communication within the Kubernetes cluster. It stores secrets in Kubernetes to be used by various components. The certificates are used by both the Management API and Cluster Connector deployments for HTTPS encryption.

		Persistent Volume (PV) and Persistent Volume Claim (PVC)
		: The Persistent Volume is the storage resource provisioned for the Postgres deployment. The Persistent Volume Claim is the request for storage by the Postgres deployment to ensure data persistence.

	The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests.
	The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens.

Conversation

minaelee commented Feb 2, 2026

Uh oh!

edlerd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edlerd Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edlerd Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minaelee Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edlerd Feb 5, 2026 •

edited

Loading

edlerd Feb 5, 2026 •

edited

Loading

minaelee Feb 4, 2026 •

edited

Loading